Skip to content

fix(memory): resolve dotted embedding provider refs against providers.models#8152

Open
OmkumarSolanki wants to merge 3 commits into
zeroclaw-labs:masterfrom
OmkumarSolanki:fix/7949-embedding-route-provider-resolution
Open

fix(memory): resolve dotted embedding provider refs against providers.models#8152
OmkumarSolanki wants to merge 3 commits into
zeroclaw-labs:masterfrom
OmkumarSolanki:fix/7949-embedding-route-provider-resolution

Conversation

@OmkumarSolanki

@OmkumarSolanki OmkumarSolanki commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Base branch: master
  • What changed and why:
    • [[embedding_routes]] (and [memory].embedding_provider) accept a dotted <type>.<alias> provider reference. It passes config validation, but the embeddings factory (create_embedding_provider) only matches openai / openrouter / custom: — so a ref like openai.default fell through to NoopEmbedding and silently degraded retrieval to keyword-only, with no error logged. The embedding-routes feature was effectively dead for any dotted ref (issue [Bug]: [[embedding_routes]] silently degrades to NoopEmbedding (route feature effectively dead) #7949).
    • Resolve the dotted ref against the canonical providers.models catalog when finalizing the embedding profile, reusing the existing ModelProviders::find_by_name: an explicit uri becomes custom:<uri>; with no uri, an openai/openrouter family passes through to the factory's built-in default endpoint.
    • Never silently Noop: a ref that does not resolve, or that resolves to a family with no usable embeddings endpoint (a non-openai/openrouter family configured without a uri, e.g. custom.<alias>), is left unresolved and logged loudly with a stable error_key.
    • Key precedence: per-route / [memory] override > the referenced provider's own key > inherited chat-seed key.
    • The catalog is threaded through create_memory_with_storage_and_routes from every runtime construction site — agent-scoped factory, gateway, daemon RPC, daemon heartbeat recall, memory CLI reindex, owned-state cascade — and read on demand at construction; no new persistent config field is introduced.
  • Scope boundary: Does NOT change config schema or validation, the EmbeddingProvider trait / OpenAiEmbedding HTTP client, non-dotted provider values (openai, openrouter, custom:<url>, none stay byte-for-byte identical), or any non-embedding provider resolution path. Does not add config keys or feature flags, and adds no struct fields.
  • Blast radius: zeroclaw-memory backend construction and its callers (gateway, runtime daemon incl. heartbeat, root binary CLI/alias paths). Only profiles whose embedding model_provider is a dotted ref change behavior; everything else is unchanged. create_memory_with_storage_and_routes gained one trailing Option<&ModelProviders> parameter (Beta-tier crate; all in-tree callers updated).
  • Linked issue(s): Closes [Bug]: [[embedding_routes]] silently degrades to NoopEmbedding (route feature effectively dead) #7949.
  • Suggested milestone: v0.8.3 (matches the issue; I lack permission to set it).
  • Labels: snapshot of suggested labels (I lack permission to set risk:/size:): bug, risk: high (per AGENTS.md path rules this touches crates/zeroclaw-runtime/src/** and crates/zeroclaw-gateway/src/**, which classify as high — overriding the issue's original risk: medium), size: M, memory, config, runtime, gateway, daemon, priority:p2.

Validation Evidence (required)

Environment: rustc 1.96.0 (ac68faa20 2026-05-25). Full battery, literal tails:

$ cargo fmt --all -- --check
# exit 0, no output (clean)

$ CARGO_INCREMENTAL=0 cargo clippy --all-targets -- -D warnings
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 52.85s
# exit 0; grep -cE '^(warning|error)' -> 0

$ CARGO_INCREMENTAL=0 cargo nextest run --locked --workspace --exclude zeroclaw-desktop
     Summary [  66.535s] 9434 tests run: 9434 passed, 11 skipped
  • Beyond CI — what did you manually verify?
    • 7 regression tests in zeroclaw-memory (19 resolve_embedding_config* tests total, all green): dotted route ref → custom:<uri> + provider key (with an end-to-end assert that the built embedder is "openai", i.e. real, not the "none" Noop); dotted ref without uri → bare openai/openrouter; route key overrides provider key; unresolvable ref left intact (not silent); dotted base embedding_provider resolves; resolved-but-no-endpoint family (custom.<alias> without uri) left unresolved (not silently mapped to a working-looking provider); custom profile with uri resolves.
    • Mutation-checks (break fix → prove test fails → restore): (1) forcing is_dotted_ref = false makes 4 dotted-resolution tests fail ("openai.default" vs custom:..., embedder name "none"); (2) restoring the old None => kind silent mapping makes the no-endpoint test fail (left: "custom" vs right: "custom.myembed").
    • Confirmed all 10 pre-existing precedence tests still pass unchanged (non-dotted providers pass None catalog → identical behavior). Verified feature-gated call sites compile under --features gateway,agent-runtime.
  • If any command was intentionally skipped, why: none skipped.

Security & Privacy Impact (required)

  • New permissions, capabilities, or file system access scope? No
  • New external network calls? No new endpoints — but note: a route that previously degraded to keyword-only will now contact the operator-configured embedding endpoint (the one named by the dotted ref) as intended. No URL is introduced by the code; it only reaches profiles the operator already configured in providers.models.
  • Secrets / tokens / credentials handling changed? Yes (scoped): when a dotted ref resolves, the referenced profile's api_key is read from providers.models and used for the embedding request, with precedence per-route/[memory] override > referenced provider key > inherited chat key. Keys are read on demand from canonical config; the WARN logs record only the provider reference string / family name, never the key.
  • PII, real identities, or personal data in diff, tests, fixtures, or docs? No (fixtures use api.example.com / embed.local / placeholder sk-*).

Compatibility (required)

  • Backward compatible? Yes. Non-dotted providers unchanged; dotted refs that were silently broken now work (or fail loudly when unsupported). No config migration needed.
  • Config / env / CLI surface changed? No user-facing surface change. Internal Rust API: zeroclaw_memory::create_memory_with_storage_and_routes gained a trailing providers: Option<&ModelProviders> parameter — Beta-tier crate, all in-tree callers updated in this PR.
  • Upgrade steps for existing users: none. Operators who configured [[embedding_routes]] with a dotted model_provider that points at an OpenAI-compatible profile will see embeddings start working; a route pointing at a family with no embeddings endpoint now logs a clear WARN instead of silently doing nothing.

Rollback (required for risk: medium and risk: high)

  • Fast rollback command/path: git revert 9c9497fd4 14fe60917 (two self-contained commits on this branch), or revert the squash-merge commit.
  • Feature flags or config toggles: None added. Operators can disable embeddings by setting [memory].embedding_provider = "none" / removing the route, exactly as before.
  • Observable failure symptoms: grep daemon logs for error_key=memory.embedding_route_unresolved (ref not in catalog), error_key=memory.embedding_route_no_endpoint (resolved family has no embeddings endpoint), or Embedding API error <status> (runtime failure from a newly-active operator endpoint). If embeddings misbehave, the revert restores the prior keyword-only fallback behavior.

….models

`[[embedding_routes]]` (and `[memory].embedding_provider`) accept a dotted
`<type>.<alias>` provider reference, which passes config validation but was
never resolved by the embeddings factory — `create_embedding_provider` only
matches `openai` / `openrouter` / `custom:`, so a ref like `openai.default`
fell through to `NoopEmbedding`, silently degrading retrieval to keyword-only
with no error logged. The embedding-routes feature was effectively dead for
any dotted provider ref.

Resolve the dotted ref against the canonical `providers.models` catalog when
finalizing the embedding profile: an explicit `uri` override becomes
`custom:<uri>`, otherwise the bare provider kind is passed through so the
factory applies its built-in family default endpoint. Key precedence is
per-route / `[memory]` override > the referenced provider's own key >
inherited chat key. An unresolvable ref is left intact and logged loudly
instead of degrading silently.

The catalog is threaded through `create_memory_with_storage_and_routes` from
every runtime call site (agent, gateway, daemon RPC, memory CLI, owned-state
cascade) and read on demand — no provider state is cached. Non-dotted
providers (`openai`, `openrouter`, `custom:<url>`, `none`) are unchanged.

Closes zeroclaw-labs#7949
@github-actions github-actions Bot added daemon Auto scope: src/daemon/** changed. gateway Auto scope: src/gateway/** changed. memory Auto scope: src/memory/** changed. runtime Auto scope: src/runtime/** changed. labels Jun 22, 2026
@OmkumarSolanki

OmkumarSolanki commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Pushed a follow-up (9c9497fd4) hardening a few cases I caught on a closer pass over the resolution path:

Resolved families with no embeddings endpoint no longer degrade silently.
find_by_name resolves any configured family, but the embeddings factory only builds openai / openrouter / custom:<url>. A uri-less custom.<alias> (or any other non-openai/openrouter family) was mapped to its bare kind and silently fell to NoopEmbedding, skipping the unresolved-reference warning. Now only openai/openrouter pass through bare; anything else with no uri is left unresolved and logged loudly (error_key=memory.embedding_route_no_endpoint). Added a mutation-checked regression test plus a positive custom-with-uri test.

Heartbeat recall now honors embedding routes.
run_heartbeat_worker previously used the bare create_memory entrypoint (empty routes, no catalog), so [[embedding_routes]] / dotted refs were ineffective there. Routed it through create_memory_with_storage_and_routes with config.embedding_routes + the catalog, matching the gateway/RPC paths.

Stable error_key on the diagnostics.
Added memory.embedding_route_unresolved and memory.embedding_route_no_endpoint to the WARN events.

Metadata. Suggesting risk: high (this touches runtime/** and gateway/**, which the path rules classify as high) and milestone v0.8.3 — I don't have label/milestone permissions on upstream, so those are snapshot suggestions.

One thing to flag for reviewers — live provider-state propagation.
OpenAiEmbedding holds its resolved base_url/api_key, and the long-lived install-wide RpcContext.memory is built once at boot, so a config/set that hot-updates a provider profile refreshes live chat sessions but not that memory handle until it is rebuilt. This is pre-existing embedder behavior — every memory embedding input ([memory].embedding_api_key, embedding_provider, dims) is already snapshotted at construction — and this PR adds no new persistent/duplicate config field (ResolvedEmbeddingConfig is a transient local; the new parameter is a borrow resolved on call). Making the embedder resolve provider state on demand from Arc<RwLock<Config>> would change the EmbeddingProvider contract across every embedding path and thread Arc<RwLock<Config>> into one-shot callers (memory reindex, alias cascade) that don't hold it — a larger, higher-risk change than this focused fix. I'd suggest tracking that as a follow-up (happy to take it on), but can fold it in here if preferred.

Validation after the changes:

  • cargo fmt --all -- --check: exit 0, clean.
  • CARGO_INCREMENTAL=0 cargo clippy --all-targets -- -D warnings: exit 0, 0 warnings.
  • CARGO_INCREMENTAL=0 cargo nextest run --locked --workspace --exclude zeroclaw-desktop: 9434 passed, 11 skipped.

…t routes

Follow-up hardening for the zeroclaw-labs#7949 fix:

- A dotted ref that resolves in `providers.models` to a family with no usable
  embeddings endpoint (a non-`openai`/`openrouter` family configured without a
  `uri`, e.g. `custom.<alias>`) was mapped to its bare kind, which the
  embeddings factory does not recognize — so it silently fell back to
  `NoopEmbedding`, bypassing the new unresolved-reference warning. Only pass
  `openai`/`openrouter` through bare; otherwise leave the reference unresolved
  and log loudly. A configured route now never degrades in silence.
- Route the daemon heartbeat recall through the routes-aware memory factory
  with the provider catalog, matching the gateway/RPC paths; previously it used
  the bare `create_memory` entrypoint (empty routes, no catalog) so
  `[[embedding_routes]]` / dotted refs were ineffective for heartbeat recall.
- Add stable `error_key` fields (`memory.embedding_route_unresolved`,
  `memory.embedding_route_no_endpoint`) to the diagnostic WARN events.

Adds regression tests for the resolved-but-no-endpoint case (mutation-checked)
and for a `custom` profile with an explicit `uri`.
@OmkumarSolanki OmkumarSolanki force-pushed the fix/7949-embedding-route-provider-resolution branch from 1021863 to 9c9497f Compare June 22, 2026 12:03
…routes

The keyword-only fallback assertion alone stays green even if the WARN is
removed — so it does not actually guard the "never silent" contract that
separates an acceptable loud fallback from the original zeroclaw-labs#7949 bug. Add a test
that captures the zeroclaw-log broadcast event and asserts severity = WARN plus
the stable `error_key = memory.embedding_route_no_endpoint` (with provider_ref /
provider_kind). Mutation-checked: deleting the WARN emission fails this test.
@OmkumarSolanki

Copy link
Copy Markdown
Contributor Author

Another follow-up (2aa4cd75f):

Diagnostic is now asserted, not just the fallback. Good catch — the ..._is_not_silent test only proved the keyword-only fallback and would have stayed green if the WARN were deleted, so it didn't actually guard the "never silent" contract. Added resolve_embedding_config_no_endpoint_emits_loud_warning, which captures the zeroclaw-log broadcast event and asserts severity_text == "WARN" and attributes.error_key == "memory.embedding_route_no_endpoint" (plus provider_ref / provider_kind). Mutation-checked by removing the WARN emission — the test then fails with expected a loud memory.embedding_route_no_endpoint WARN event.

On resolving provider endpoint/key from live config vs. construction-time materialization — I want to lay out why I've scoped this as a follow-up rather than fold it into this PR, and I'm happy to be overruled:

  • This PR adds no new persistent/duplicate config field; ResolvedEmbeddingConfig is a transient local and the new parameter is a borrow resolved per construction call. The duplicate-state forcing mechanism (dev/ci.sh field check) targets new stored fields and doesn't fire here.
  • The established pattern for provider credentials in long-lived runtime objects in this codebase is snapshot-at-build + refresh-on-config/set, not live per-request resolution: chat providers cache their resolved key/endpoint and are rebuilt by refresh_live_sessions_for_model_provider when a provider profile changes. Memory has no equivalent refresh hook — that's the real gap, and it predates this PR (every memory embedding input is already snapshotted at backend construction).
  • Making the embedder resolve live would also require reordering daemon startup: the install-wide memory is built (daemon/mod.rs ~L337) before the Arc<RwLock<Config>> that config/set mutates even exists (it's created ~L377 by cloning), and one-shot callers (memory reindex, alias cascade) hold only a &Config. So it's a cross-cutting change to the EmbeddingProvider contract + startup wiring + several call sites — meaningfully larger and riskier than this focused [Bug]: [[embedding_routes]] silently degrades to NoopEmbedding (route feature effectively dead) #7949 fix.

The correct long-term fix is either live resolution in the embedder or a memory-refresh hook on config/set mirroring the chat-session path. I'd prefer to land the focused route-resolution fix here (it makes a dead feature work and never degrade silently) and track live provider-state propagation as a follow-up — happy to file and take it on. If you'd rather it land here, I'll do the on-demand-resolver version with the "mutate the canonical entry, next embed observes it" test.

Validation after these changes:

  • cargo fmt --all -- --check: exit 0, clean.
  • CARGO_INCREMENTAL=0 cargo clippy --all-targets -- -D warnings: exit 0, 0 warnings.
  • CARGO_INCREMENTAL=0 cargo nextest run --locked --workspace --exclude zeroclaw-desktop: 9435 passed, 11 skipped.

@WareWolf-MoonWall WareWolf-MoonWall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR #8152

fix(memory): resolve dotted embedding provider refs against providers.models
Author: OmkumarSolanki (Omkumar Solanki) | Head: 2aa4cd75f68abdd051ecae3e9b5a1cf5139943a5
Reviewer: WareWolf-MoonWall | Verdict: --approve


🟢 Resolution correctness — find_by_name splits on first dot; all edge cases handled

resolve_provider_ref correctly detects dotted refs via:

let is_dotted_ref =
    !trimmed.is_empty() && !trimmed.starts_with("custom:") && trimmed.contains('.');

custom: URLs are excluded (they contain a colon, not a dot leading to a provider alias). find_by_name in providers.rs splits on the first dot via name.split_once('.'), performs a typed map lookup by kind, and returns None on an unknown kind or missing alias.

Edge cases:

  • Missing provider ref: find_by_name returns None → logged at WARN with error_key=memory.embedding_route_unresolved, original dotted string preserved as model_provider, factory gets NoopEmbedding. No silent degradation.
  • Resolved family with no embeddings endpoint: only openai/openrouter pass through bare; any other family without a uri (e.g. a custom.<alias> without a uri) hits the None arm → logged at WARN with error_key=memory.embedding_route_no_endpoint. No silent degradation.
  • Resolved family with explicit uri: becomes custom:<uri>, which the embeddings factory handles as any OpenAI-compatible endpoint.
  • Circular refs: not possible — provider names are plain strings, not recursive structures.
  • Deeply nested refs: not applicable — find_by_name splits on first dot only, which is the correct <kind>.<alias> format. A ref like openai.default.extra would be treated as kind=openai, alias=default.extra, which would miss in the alias map and return None, logging loudly.

🟡 Warning — construction-time snapshot, not live resolution; acknowledged, follow-up needed

OpenAiEmbedding holds its resolved base_url/api_key at construction time. A long-lived install-wide RpcContext.memory built at daemon startup will not observe a config/set hot-update to the referenced provider profile until the daemon restarts or a memory-refresh hook (analogous to refresh_live_sessions_for_model_provider for chat) is added.

The author has been transparent about this limitation, explained the pre-existing pattern, and documented the startup-ordering constraint that prevents live resolution in a single PR. This is accepted as a scoped limitation for this fix. However, a follow-up issue tracking the memory-refresh hook (or an on-demand resolver pattern) should be filed and linked. The existing pattern of snapshot-at-build is pre-existing behavior, not newly introduced by this PR.


🟢 SSOT — no new persistent config field introduced

ResolvedEmbeddingConfig is a transient local, computed and consumed within create_memory_with_storage_and_routes. The new providers: Option<&ModelProviders> parameter is a borrow resolved at call time — it is NOT stored in any struct field. No struct in zeroclaw-memory or its callers gains a cached copy of provider state.


🟢 Error handling — loud WARN, stable error keys, not silently swallowed

Both failure modes emit structured WARN events through zeroclaw_log::record!:

  • error_key=memory.embedding_route_unresolved — ref not in providers.models
  • error_key=memory.embedding_route_no_endpoint — resolved family has no usable embeddings endpoint

These are log events (not user-facing CLI output), so English messages with stable error_key fields are correct per AGENTS.md (§ RFC #5653 §4.6). No fl!() required here.

The Rollback section correctly cites both error_key values as observable failure symptoms, matching the code.


🟢 Fluent / i18n — no new user-facing strings

No new CLI output, no new user-facing error messages. All new text is log-level WARN events with English messages and stable error_key attributes. Correct.


🟡 Warning — risk label missing; should be risk: high

The PR labels show daemon, gateway, memory, runtime with no risk label. Per AGENTS.md: "High risk: crates/zeroclaw-runtime/src/** (especially src/security/), crates/zeroclaw-gateway/src/** …" — this PR touches both. The author correctly identified risk: high in the PR body but lacks permission to apply the label. A maintainer should set risk: high and size: M before merge.


🟢 Rollback — complete, with SHAs and observable symptoms

git revert 9c9497fd4 14fe60917

Two self-contained commits, revert restores prior keyword-only behavior. Observable symptoms and feature-flag disable path ([memory].embedding_provider = "none") documented. Correct for a risk: high change.


🟢 Heartbeat worker now honors embedding routes

run_heartbeat_worker was previously using the bare create_memory entrypoint (empty routes, no catalog), causing [[embedding_routes]] to be ignored during heartbeat recall. Routing it through create_memory_with_storage_and_routes with config.embedding_routes and the catalog closes this gap. This was correctly self-caught and addressed in the follow-up commit.


🟢 Tests — thorough, mutation-checked, diagnostic asserted

  • 19 resolve_embedding_config* tests covering all code paths.
  • 7 new regression tests for dotted-ref resolution.
  • resolve_embedding_config_no_endpoint_emits_loud_warning captures the broadcast log event and asserts severity_text == "WARN" and attributes.error_key == "memory.embedding_route_no_endpoint" — this is the correct way to guard the "never silent" contract; a fallback-only assertion would stay green if the WARN were deleted.
  • Mutation checks performed: forcing is_dotted_ref = false fails 4 tests; restoring the old silent None => kind mapping fails the no-endpoint test.
  • 9435 total tests passing, 11 skipped.

🟢 PR template — complete

All required sections present: Summary (detailed scope boundary, blast radius), Validation Evidence (literal command output with full nextest run), Security & Privacy Impact (checklist, api_key handling correctly noted), Compatibility (breaking internal API change scoped and explained), Rollback.


🟢 CI — fully green

All 18 quality-gate checks pass: Format, Lint, Test, Build x86_64, Check matrix (32-bit, all-features, no-default, aarch64-apple-darwin, x86_64-pc-windows-msvc), Docs Style, Installer Drift, Nix Module Eval, Security, Zerocode RPC Boundary, CI Required Gate.


Summary

Verdict: --approve. The fix correctly resolves a dead-feature bug ([[embedding_routes]] dotted refs silently degrading to Noop), handles all relevant edge cases loudly, introduces no duplicate state, is thoroughly tested, and has a complete PR template and full green CI.

Two action items for maintainer before squash-merge:

  1. Apply risk: high and size: M labels (author flagged this; self-assessed correctly).
  2. File a follow-up issue tracking live provider-state propagation for the memory embedding backend (construction-time snapshot vs. config/set hot-reload), linking to the discussion in this PR's comments.

@WareWolf-MoonWall

Copy link
Copy Markdown
Collaborator

Labels note for maintainers: the diff touches memory + runtime + gateway resolution paths — the auto-applied labels show no risk: label. Per the risk tier table, this warrants risk: high. Please apply before merge.

@WareWolf-MoonWall WareWolf-MoonWall added this to the v0.8.2 milestone Jun 22, 2026
@Audacity88 Audacity88 added the bug Something isn't working label Jun 22, 2026
@Audacity88 Audacity88 removed this from the v0.8.2 milestone Jun 22, 2026
@Audacity88 Audacity88 added the config Auto scope: src/config/** changed. label Jun 22, 2026
@Audacity88 Audacity88 added this to the v0.8.3 milestone Jun 22, 2026
@Audacity88 Audacity88 added provider Auto scope: src/providers/** changed. provider:openai Auto module: provider/openai changed. provider:openrouter Auto module: provider/openrouter changed. risk: high Auto risk: security/runtime/gateway/tools/workflows. size: M Auto size: 251-500 non-doc changed lines. labels Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working config Auto scope: src/config/** changed. daemon Auto scope: src/daemon/** changed. gateway Auto scope: src/gateway/** changed. memory Auto scope: src/memory/** changed. provider:openai Auto module: provider/openai changed. provider:openrouter Auto module: provider/openrouter changed. provider Auto scope: src/providers/** changed. risk: high Auto risk: security/runtime/gateway/tools/workflows. runtime Auto scope: src/runtime/** changed. size: M Auto size: 251-500 non-doc changed lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: [[embedding_routes]] silently degrades to NoopEmbedding (route feature effectively dead)

3 participants